机器学习笔记
目录
统计学习方法. 1
统计学习方法概论. 1
基本概念. 1
统计学习三要素. 1
模型的评价与模型选择. 2
机器学习. 3
加载数据集. 3
加载数据. 3
划分数据集. 3
学习和预测. 3
Example 3
保存模型. 4
最近邻法. 4
画出decision边界. 4
模型选择:选模型、选参数. 5
维数灾. 7
感知器. 8
K近邻算法. 9
线性模型. 9
逻辑回归LogisticRegression 12
AdaBoost 12
支持向量机SVM 14
聚类. 15
Decompositions分解、降维. 17
Pipelining 管道. 19
类型转换. 21
1.没有特殊说明的话,输入需要(?)被转换为float64 21
2.回归结果将转换为float64,分类目标则保持原态. 21
Numpy 21
1.array vs list 21
2.np.unique(iris_y) 22
Pandas 22
Example 22
Working With Text Data - 20 newsgroups dataset 22
统计学习方法
统计学习方法概论
基本概念
是联合概率分布,独立同分布产生数据。X,Y具有联合概率分布是监督学习的基本假设。
假设空间(Hypothesis space)中,用决策函数 或 条件概率分布两种表示。
统计学习三要素
方法=模型+策略+算法
模型
假设空间可以定义为决策函数的集合 ,也可以定义为条件概率的集合 。参数向量 取值于n维参数空间。
策略
损失函数一般用L(Y(x),f(x))表示;常用的损失函数有:
0-1损失函数:
平方损失函数:
绝对损失函数:L=|Y-f(X)|
对数损失函数或对数似然损失函数:
指数损失函数:
差比较图###
期望风险:
经验风险:
结构风险: .
结构风险最小化是为了防止过拟合提出来的模型,等价于正则化,J(f)为模型的复杂度,模型越复杂,越大,越简单,越小。贝叶斯估计中的最大后验概率就是结构风险最小的一个列子。
算法
根据策略,从假设空间中选取最优模型和参数 。
模型的评价与模型选择
训练误差和测试误差
其中误差是由特定的损失函数算出来的,比如0-1,平方差损失函数等。
机器学习
加载数据集
加载数据
from sklearn import datasets
iris = datasets.load_iris()
划分数据集
随机排列
np.random.seed(0)
# indices = np.random.permutation(len(iris_X)) # 随机排列
# print indices
# iris_X_train = iris_X[indices[:-10]]
# iris_y_train = iris_y[indices[:-10]]
# iris_X_test = iris_X[indices[-10:]]
# iris_y_test = iris_y[indices[-10:]]
学习和预测
Example
from sklearn import
svm
clf = svm.SVC(gamma=0.001, C=100)
clf.fit(digits.data[:-1], digits.target[:-1])
print(clf)
print clf.predict(digits.data[-1:])
更新参数(sklearn.pipeline.Pipeline.set_params)
clf.set_params(kernel='linear').fit(X, y)
clf.set_params(kernel='rbf').fit(X, y)
保存模型
import pickle
s = pickle.dumps(clf)
clf2 = pickle.loads(s)
print digits.target[-2]
print clf2.predict(digits.data[-2])
# 2
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
clf = joblib.load('filename.pkl')
最近邻法
from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier()
>>> knn.fit(iris_X_train, iris_y_train)
画出decision边界
# 颜色地图
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
for weights in ['uniform', 'distance']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
h = .02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
print(xx)
# Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
# Put the result into a color plot
Z = Z.reshape(xx.shape)
print(xx.shape)
plt.figure()
# 填充数据网格
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# plt.contourf(xx, yy, Z, alpha=0.4, cmap=cmap_light)
# 画training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, weights = '%s')" % (n_neighbors, weights))
plt.show()
模型选择:选模型、选参数
Score, and cross-validated scores
score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.
Cross-validation
kfold = cross_validation.KFold(len(X_digits), n_folds=3)
>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) for train, test in kfold]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
>>> cross_validation.cross_val_score(svc, X_digits, y_digits, cv=kfold, n_jobs=-1)
array([ 0.93489149, 0.95659432, 0.93989983])
Cross-validation generators
KFold (n, k) |
StratifiedKFold (y, k) |
LeaveOneOut (n) |
LeaveOneLabelOut (labels) |
Split it K folds, train on K-1 and then test on left-out |
It preserves the class ratios / label distribution within each fold. |
Leave one observation out |
Takes a label array to group observations |
Grid-search and cross-validated estimators
from sklearn.grid_search import GridSearchCV
Cs = np.logspace(-6, -1, 10)
clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.925...
>>> clf.best_estimator_.C
0.0077...
>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...
GridSearchCV 默认用的 3-fold cross-validation。但是如果是分类器不是回归问题
将使用StratifiedKFold来保证每一折label比例相同。
Cross-validated estimators
from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()
diabetes = datasets.load_diabetes()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
verbose=False)
>>> # The estimator chose automatically its lambda:
>>> lasso.alpha_
0.01229...
维数灾
首先 Error = Bias + Variance
Error反映的是整个模型的准确度,Bias反映的是模型在样本上的输出与真实值之间的误差,即模型本身的精准度,Variance反映的是模型每一次输出结果与模型输出期望之间的误差,即模型的稳定性。
简单的数据N = 10 * d
严格的数据N = 10 ^ d
感知器
Perceptron Learning Algorithm
PLA与SGD的关系:
Perceptron and SGDClassifier share the same underlying implementation. In fact, Perceptron() is equivalent to SGDClassifier(loss=”perceptron”, eta0=1, learning_rate=”constant”, penalty=None)
K近邻算法
- 无显式学习过程,通过投票的分类策略进行k近邻分类。
- 模型是利用训练集数据对特征向量空间划分。
- 三要素:距离量度、k值选择、分类策略。
- K值越大,模型越简单;k值越小,模型越复杂,容易过拟合。
- 投票策略规则等价于经验风险最小化。
- 线型扫描速度太慢,用kd树结构来提升性能。(数据结构知识需要)。
线性模型
from sklearn import linear_model
regr = linear_model.LinearRegression()
>>> regr.fit(diabetes_X_train, diabetes_y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> print(regr.coef_)
[ 0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937
492.81458798 102.84845219 184.60648906 743.51961675 76.09517222]
>>> # The mean square error均方差
>>> np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2)
2004.56760268...
>>> # Explained variance score: 1 is perfect prediction
>>> # and 0 means that there is no linear relationship
>>> # between X and Y.方差
>>> regr.score(diabetes_X_test, diabetes_y_test)
0.5850753022690...
Shrinkage 收缩
出现情况:每个维度数据少,且有高方差的噪声
regr = linear_model.LinearRegression()
pl.plot(test, regr.predict(test))
解决方案:
收缩回归系数到0;岭回归引入的偏差其实是正则化。捕捉噪声的规律使其不能再新数据上通用的情况叫做过拟合。
regr = linear_model.Ridge(alpha=.1)
pl.figure()
np.random.seed(0)
for _ in range(6):
this_X = .1*np.random.normal(size=(2, 1)) + X
regr.fit(this_X, y)
pl.plot(test, regr.predict(test))
pl.scatter(this_X, y, s=3)
alphas = np.logspace(-4, -1, 6)
from __future__ import print_function
>>> print([regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train,).score(diabetes_X_test, diabetes_y_test) for alpha in alphas])
[0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.5830717085554..., 0.57058999437...]
典型的bias/variance tradeoff:大 ridge alpha parameter, 高bias and 低 variance.
Sparsity稀疏
Example:diabetes dataset 涉及11个维度,很难通过可视化手段分析出有用的信息,但是时刻谨记着模型(数据)有个能是个相当空的space,这也许非常重要。
稀疏是只选择有信息的特征,把没信息的特征系数置零。Ridge regression是减少还没有到零。Lasso (least absolute shrinkage and selection operator)是置零。这种方法被称为稀疏化或稀疏方法,同时他是一种Occam’s razor : prefer simpler models.
regr = linear_model.Lasso()
scores = [regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train).score(diabetes_X_test, diabetes_y_test)for alpha in alphas]
best_alpha = alphas[scores.index(max(scores))]
regr.alpha = best_alpha
>>> regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
>>> print(regr.coef_)
[ 0. -212.43764548 517.19478111 313.77959962 -160.8303982 -0.
-187.19554705 69.38229038 508.66011217 71.84239008]
Lasso ->using a coordinate decent method, that is efficient on large datasets.
LassoLars -> using the LARS algorithm which is is very efficient for problems in which the weight vector estimated is very sparse
LogisticRegression中C决定着正则化的程度: a large value for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse coefficients), while penalty="l1" gives Sparsity.
逻辑回归LogisticRegression
Example:对于iris task,线型回归不是正确的方法,因为它给远离决策边界的数据太大的权重,有效的方法是fit logitstic function(更接近于阶跃信号)。
AdaBoost
提升方法 AdaBoost 算法
算法 (AdaBoost)
输入:训练数据集T={(x1,y1),(x2,y2),...,(xn,yn)},弱学习算法;
输出:最终分类器 G(x).
(1)初始化训练数据的权值分布
D1=(w11,...,w1i,...,w1N),w1i=
, i=1,2,...,N
(2) 对 m= 1,2,…,M
(a) 使用具有权值分布化Dm的训练数据集学习,得到基本分类器Gm(x)
(b) 计算Gm(X)
在训练数据集上的分类误差率
em=P(Gm(xi)
yi)=
(C) 计算 Gm
(X) 的系数
=
这里的对数是自然对数.
(d) 更新训练数据集的权值分布
Dm+1=(wm+1,1,…, wm+1,i,…,
wm+1,N)
这里,
是规范因子,
,它使Dm+1成为一个概率分布。
(3) 构建基本分类器的线性组合
得到最终分类器
对 AdaBoost 算法作如下说明:
步骤(1)假设训练数据集具有均匀的权值分布,即每个训练样本在基本分类器的学习中作用相同,
这一假设保证第一步能够在原始数据上学习基本分类器G1(x)
步骤(2) AdaBoost 反复学习基本分类器,在每一轮m=1,2,…,M顺次地执行下列操作:
(a) 使用当前分布Dm加权的训练数据集,学习基本分类器 Gm(x).
(b) 计算基本分类器 Gm(x)在加权训练数据集上的分类误差率:
=P(Gm(xi)
yi)=
由此可以看出数据权值分布Dm与基本分类器 Gm(x)的分类误差率的关系.
(c) 计算基本分类器 Gm
(x) 的系数表示
,
在最终分类器中的重要性。当
<0.5时,
>0, 并且
随着
的减小而增大,所以分类误差率越小的基本分类器在最终分类器中的作用越大.
(d) 更新训练数据的权值分布为下一轮作准备:
由此可知,被基木分类器
误分类样本的权值得以扩大,而被正确分类样本的权值却得以缩小。两相比较,误分类样本的权值被放大
倍.因此,误分类样本在下一轮学习中起更大的作用。不改变所给的训练数据,而不断改变训练数据权值的分布,使得训练数据在基本分类器的学习中起不同的作用 ,这是AdaBoost的一个特点。
步骤(3)线性组合f(x)实现M个基本分类器的加权表决.系数
表示了基本分类器
的重要性,这里,所有
之和并不为1. f(x)的符号决定实例x的类, f(x)的绝对值表示分类的确信度。利用基本分类器的线性组合构建最终分类器是 AdaBoost 的另一特点。
支持向量机SVM
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0, :2]
y = y[y != 0]
n_sample = len(X)
np.random.seed(0)
order = np.random.permutation(n_sample)
X = X[order]
y = y[order].astype(np.float)
X_train = X[:.9 * n_sample]
y_train = y[:.9 * n_sample]
X_test = X[.9 * n_sample:]
y_test = y[.9 * n_sample:]
# fit the model
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
clf = svm.SVC(kernel=kernel, gamma=10)
clf.fit(X_train, y_train)
plt.figure(fig_num)
plt.clf()
plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired)
# Circle out the test data
plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none', zorder=10)
plt.axis('tight')
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
# Put the result into a color plot
Z = Z.reshape(XX.shape)
plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
levels=[-.5, 0, .5])
plt.title(kernel)
plt.show()
聚类
k-means
Hierarchical agglomerative clustering: Ward分层聚类
Agglomerative - bottom-up
Divisive - top-down
from sklearn.feature_extraction.image import grid_to_graph
from sklearn.cluster import AgglomerativeClustering
# Generate data
lena = sp.misc.lena()
# Downsample the image by a factor of 4
lena = lena[::2, ::2] + lena[1::2, ::2] + lena[::2, 1::2] + lena[1::2, 1::2]
X = np.reshape(lena, (-1, 1))
###############################################################################
# Define the structure A of the data. Pixels connected to their neighbors.
connectivity = grid_to_graph(*lena.shape)
###############################################################################
# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 15 # number of regions
ward = AgglomerativeClustering(n_clusters=n_clusters,
linkage='ward',
connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_, lena.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)
Connectivity-constrained clustering关联限制聚类
Feature agglomeration特征聚集->用来降维数据
我们已经看到稀疏性可以用于减轻维度灾,即与特征的数量相比观察量不足。 另一种方法是将类似特征合并在一起:特征聚集。 这种方法可以通过在特征方向上聚类来实现,换句话说,对转置的数据进行聚类。
digits = datasets.load_digits()
images = digits.images
X = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(*images[0].shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
... n_clusters=32)
>>> agglo.fit(X)
FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',...
X_reduced = agglo.transform(X)
X_approx = agglo.inverse_transform(X_reduced)
>>> images_approx = np.reshape(X_approx, images.shape)
Decompositions分解、降维
PCA
The point cloud spanned by the observations above is very flat in one direction: one of the three univariate features can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat.
三个维度,有一个维度特别扁平,用其他俩个维度可以计算出来。PCA就是找不扁平的方向。
>>> # Create a signal with only 2 useful dimensions
>>> x1 = np.random.normal(size=100)
>>> x2 = np.random.normal(size=100)
>>> x3 = x1 + x2
>>> X = np.c_[x1, x2, x3]
>>> from sklearn import decomposition
>>> pca = decomposition.PCA()
>>> pca.fit(X)
PCA(copy=True, n_components=None, whiten=False)
>>> print(pca.explained_variance_)
[ 2.18565811e+00 1.19346747e+00 8.43026679e-32]
>>> # As we can see, only the 2 first components are useful
>>> pca.n_components = 2
>>> X_reduced = pca.fit_transform(X)
>>> X_reduced.shape
(100, 2)
ICA-> Independent Component Analysis
ICA selects components so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals:
ICA选择组件,使得它们的负载的分布携带最大量的独立信息。 它能够恢复非高斯独立信号:
>>> # Generate sample data
>>> time = np.linspace(0, 10, 2000)
>>> s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal
>>> s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal
>>> S = np.c_[s1, s2]
>>> S += 0.2 * np.random.normal(size=S.shape) # Add noise
>>> S /= S.std(axis=0) # Standardize data
>>> # Mix data
>>> A = np.array([[1, 1], [0.5, 2]]) # Mixing matrix
>>> X = np.dot(S, A.T) # Generate observations
>>> # Compute ICA
>>> ica = decomposition.FastICA()
>>> S_ = ica.fit_transform(X) # Get the estimated sources
>>> A_ = ica.mixing_.T
>>> np.allclose(X, np.dot(S_, A_) + ica.mean_)
True
Pipelining 管道
Example:结合transform模型和predict模型
The PCA does an unsupervised dimensionality reduction, while the logistic,regression does the prediction.
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
logistic = linear_model.LogisticRegression()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
# Plot the PCA spectrum光谱
pca.fit(X_digits)
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')
# Prediction
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3)
#Parameters of pipelines can be set using ‘__’ separated parameter names:
estimator = GridSearchCV(pipe, dict(pca__n_components=n_components,logistic__C=Cs))
estimator.fit(X_digits, y_digits)
plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
linestyle=':', label='n_components chosen')
plt.legend(prop=dict(size=12))
类型转换
1.没有特殊说明的话,输入需要(?)被转换为float64
X = np.array(X, dtype='float32')
>>> X.dtype
dtype('float32')
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.dtype
dtype('float64')
2.回归结果将转换为float64,分类目标则保持原态
clf.fit(iris.data, iris.target)
>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]
clf.fit(iris.data, iris.target_names[iris.target])
>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']
Numpy
1.array vs list
python中的list是python的内置数据类型,list中的数据类不必相同的,而array的中的类型必须全部相同。
2.np.unique(iris_y)
Pandas
Example
Working With Text Data - 20 newsgroups dataset
Loaddata
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
# print twenty_train.target_names
# print len(twenty_train.data)
# print len(twenty_train.filenames)
# print twenty_train.data[0].split('\n')[:3]
# print twenty_train.target_names[twenty_train.target[0]] # 装配
# print twenty_train.target[:10]
# print twenty_train.target_names
提取特征 - 词袋索引
# X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM
# which is barely manageable on today’s computers.
# Fortunately, most values in X will be zeros since for a given document less than a couple
# thousands of distinct words will be used.
# For this reason we say that bags of words are typically high-dimensional sparse datasets.
# scipy.sparse matrices are data structures that do exactly this,
# and scikit-learn has built-in support for these structures.
# Text preprocessing, tokenizing and filtering of stopwords are included in
# sklearn.feature_extraction.text.CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
print X_train_counts.shape
Tfidf模型
# tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
# X_train_tf = tf_transformer.transform(X_train_counts)
# print X_train_tf.shape
# print X_train_tf[0]
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print X_train_tfidf.shape
Classifer
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))
pipeline
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
Evaluation
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
print np.mean(predicted == twenty_test.target)
Other Classifer
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)), ])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
print np.mean(predicted == twenty_test.target)
混乱矩阵 - detailed performance analysis
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))
print metrics.confusion_matrix(twenty_test.target, predicted)
调参数 - Parameter tuning
GridSearchCV
from sklearn.grid_search import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3), }
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1)
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
print twenty_train.target_names[gs_clf.predict(['God is love'])]
best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, best_parameters[param_name]))
print score